Advanced Statistical Inference I

Chelsea Parlett-Pelleriti

Bayesian Regression Models

Regression (GLM) Model Review

\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]

  • link function: \(g()\)

  • likelihood function: \(\pi()\)

Linear Regression Review

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \\ \mu_i = \mathbf{X_i}\beta = \beta_0 + \beta_1 \cdot x_{i1} + ... + \beta_n \cdot x_{in} \]

Bayesian Linear Regression

\[ y_i \sim \mathcal{N}(\mu_i, \color{red}\sigma^2) \\ \mu_i = \mathbf{X_i}\color{red}\beta = \color{red}\beta_0 + \color{red}\beta_1 \cdot x_{i1} + ... + \color{red}\beta_n \cdot x_{in} \]

Setting Priors

Choosing BLR Priors

Let’s look at a regression model that uses age to predict income (in thousands).

\[ \underbrace{\mu_i}_{\mathbb{E}(\text{income})} = \color{red}{\beta_0}+ \color{red}{\beta_1} \cdot age_i \\ \text{income}_i \sim \mathcal{N}(\mu_i, \color{red}{\sigma^2}) \]

\[ \beta_0 \sim \mathcal{N}(m_0, s_0^2); \beta_1 \sim \mathcal{N}(m_1, s_1^2);\sigma \sim \Gamma(0.1,10) \]

❓ What other distributions might be reasonable for our two \(\beta\)s?

Choosing BLR Priors

\(\beta_0 \sim \mathcal{N}(0, 1); \beta_1 \sim \mathcal{N}(2, 4);\sigma \sim \Gamma(0.1,10)\)

We put priors on (at least) all parameters in our model.

Uniform Priors

If we do not set priors, most software will assume default priors. In some cases, this means Uninformative Uniform Priors.

Uniform Priors

\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]

If your prior is uniform, it applies the same multiplier to each possible value of \(\theta\) (your given parameter). So…

\[ \text{Posterior} \propto \text{Likelihood} \]

Uniform Priors

Pros:

  • “Objective” 🙄

  • let’s the “data speak for themselves”

  • easy to explain

  • sometimes gives results roughly equivalent to Frequentist estimates

Cons:

  • you often know something (non-regularizing)

  • non-invariance under re-parameterization

  • improper over \(\mathbb{R_n}\)

Uniform Priors

library(brms)
# freq
lm(y ~ x, data = df) |> summary()

Call:
lm(formula = y ~ x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.82484 -0.59236 -0.03289  0.64124  2.88284 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.08791    0.09502   0.925    0.357    
x           -0.52576    0.09892  -5.315  6.7e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9501 on 98 degrees of freedom
Multiple R-squared:  0.2238,    Adjusted R-squared:  0.2158 
F-statistic: 28.25 on 1 and 98 DF,  p-value: 6.703e-07

Uniform Priors (Freq)

# bayes
priors <- c(
  prior("", class = "b"),
  prior("", class = "Intercept"),
  prior("uniform(0, 1e6)", class = "sigma")
)
brm(y ~ x, data = df, prior = priors, verbose = 2) |> summary()
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x 
   Data: df (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.09      0.10    -0.10     0.28 1.00     4245     2895
x            -0.52      0.10    -0.72    -0.33 1.00     4159     2776

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.96      0.07     0.84     1.10 1.00     4471     3159

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Uniform Priors (Non-Invariance)

Consider a logistic regression:

\[ \underbrace{\mathbb{E}\left( y | \mathbf{X} \right)}_\text{pred prob} = \frac{1}{1 + e^{-\mathbf{X}\beta}} \]

  • Parameterization 1: estimate log odds coefficients \(\beta\)s

  • Parameterization 2: estimate odds coefficients \(e^{\beta}\)

Uniform Priors (Non-Invariance)

Parameterization 1

Let’s feign ignorance…we know nothing about what \(\beta\)s should be, so we place a uniform prior on the coefficients.

Uniform Priors (Non-Invariance)

Parameterization 2

By the same logic…we know nothing about what \(e^{\beta}\)s should be, so we place a uniform prior on the coefficients.

Uniform Priors (Non-Invariance)

First, let’s look at what Parameterization 1 (uniform prior on \(\beta\)) implies about \(e^\beta\):

Uniform Priors (Non-Invariance)

First, let’s look at what Parameterization 1 (uniform prior on \(\beta\)) implies about \(e^\beta\):

\[ p_{\beta}(\beta) = \frac{1}{upper - lower} = \frac{1}{b-a} \]

let \(\theta = e^\beta\) so \(\beta = ln(\theta)\):

\[ \underbrace{p_\theta(\theta) = p_\beta(\beta) \cdot \lvert\frac{d \beta}{d \theta}\rvert}_\text{change of variable} \\ p_\theta(\theta) = \frac{1}{b-a} \cdot \lvert\frac{1}{\theta}\rvert = \frac{1}{(b-a)\theta} \]

❗ this does not give all \(\theta\)s the same likelihood!!!

Uniform Priors (Non-Invariance)

Now, let’s look at what Parameterization 2 (uniform prior on \(e^\beta\)) implies about \(\beta\):

\[ p_{\theta}(\theta) = \frac{1}{upper - lower} = \frac{1}{b-a} \]

\(\beta = ln(\theta)\) so \(\theta = e^\beta\):

\[ \underbrace{p_\beta(\beta) = p_\theta(\theta) \cdot \lvert\frac{d \theta}{d \beta}\rvert}_\text{change of variable} \\ p_\beta(\beta) = \frac{1}{b-a} \cdot \lvert e^\beta\rvert = \frac{e^\beta}{(b-a)} \]

Uniform Priors (Non-Invariance)

Now, let’s look at what Parameterization 2 (uniform prior on \(e^\beta\)) implies about \(\beta\):

Jeffrey’s Prior

\[ p(\theta) \propto \sqrt{I(\theta)} \]

  • Jeffrey’s Prior: method of constructing non-informative priors that are invariant under re-parameterization

(see Bayesian Data Analysis 2.8 for more math)

Fisher Information

Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)

Fisher Information

Math Idea:

\[ I_X(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell(\theta | X)}{\partial\theta^2} \right] \]

where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information

Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation

Uniform Priors (Regularization)

Regularization (in general): discourages overfitting, makes models simpler

Regularizing Priors: keeps parameter values in a “reasonable range”

Uniform Priors (Regularization)

Ridge = Normal Prior, Lasso = Laplacian Prior.

Other Types of Priors

  • Uniformative: all possible values for \(\theta\) have the same relative prior likelihood

  • Weakly Informative: regularize our estiamtes, extreme values are less likely, but not much other info

  • Strongly Informative: prior is narrow around known likely values

Choosing Prior Distributions

  • think about the range of possible values of \(\theta\)

  • think about amount of prior uncertainty

  • think about regularization

In reality, we often report 2-3 different versions of the models with various priors to see how these priors change the inferences we make.

Setting Priors in brms

set.seed(540)
n <- 100
x <- rnorm(n)
y <- 0.25 - x*0.5 + rnorm(n)
df <- data.frame(x,y)

# bayes
priors <- c(
  prior("normal(0,5)", class = "b"),
  prior("normal(0,4)", class = "Intercept"),
  prior("gamma(0.1, 10)", class = "sigma")
)
brm(y ~ x, data = df, prior = priors, verbose = 2) |> summary()
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
         ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1

CHECKING DATA AND PREPROCESSING FOR MODEL 'anon_model' NOW.

COMPILING MODEL 'anon_model' NOW.

STARTING SAMPLER FOR MODEL 'anon_model' NOW.

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 1:                0.006 seconds (Sampling)
Chain 1:                0.014 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 2:                0.006 seconds (Sampling)
Chain 2:                0.014 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.02 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 3:                0.007 seconds (Sampling)
Chain 3:                0.015 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 4e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 4:                0.007 seconds (Sampling)
Chain 4:                0.015 seconds (Total)
Chain 4: 
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x 
   Data: df (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.25      0.10     0.06     0.44 1.00     4512     2874
x            -0.49      0.11    -0.70    -0.28 1.00     4252     2718

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.98      0.07     0.87     1.13 1.00     3547     2877

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).